In recent years, the use of Multilevel Regression with Post-stratification (MRP) models has gained traction as a powerful tool for predicting election outcomes. By combining individual-level survey data with census information, MRP allows for more accurate estimates of vote shares at the constituency level. This method has shown promising results in forecasting elections, providing valuable insights into voter behavior and preferences.
This analysis delves into the application of MRP to predict vote share estimates for the 2019 UK General Election. Section 1 explore the steps involved in constructing the post-stratification frame, following which Section 2 and 3 models party vote intentions and estimates voter turnout respectively. Section 4 weights predictions with the post-stratification step of MRP, and the following section scaling estimates to match true election results. Additionally, Section 6 evaluate the performance of the MRP model and discusses further improvements.
Section 1. Creating the post-stratification frame
Post-stratification data was taken from the Office for National Statistics (ONS) website [1].
I selected the population type “All usual residents” since they make up people who are usually resident in England or Wales and it is the main population base for Census statistics. The Area Type was specified as “Westminister Parliamentary Constituencies” which include 573 constituencies in England and Wales.
(Note: As of 2024 there are now 575 constituencies in England and Wales but the analysis focuses on 2019 elections and thus the number of constituencies were identical to those in previous General Elections since 2005 for Scotland and 2010 for the rest of the UK).
The variables chosen for post-stratification were age, highest level of education, and sex. For each constituency, there are 70 different demographic groups based on these variables.
The data was downloaded as a CSV file and loaded into R.
Cleaning data
# Load the data
poststrat <- read.csv("ons.csv")
# Rename the columns
colnames(poststrat) <- c("ccode", "cname", "age_code", "age_cat", "edu_code","edu_labels", "sex_code","sex_cat", "count")
## age_cat
# Transform age categories to factor variables
poststrat$age_cat <- factor(poststrat$age_cat)
# Remove rows with "Aged 15 years and under"
poststrat <- poststrat %>% filter(age_cat != "Aged 15 years and under")
## edu_code
# Recode education categories
poststrat <- poststrat %>%
mutate(edu_cat = case_when(
edu_code == "0" ~ "No qualifications",
edu_code == "1" ~ "Level 1",
edu_code == "2" ~ "Level 2",
edu_code == "3" ~ "Level 3",
edu_code == "4" ~ "Level 4 and above",
TRUE ~ "Other"
))
poststrat$edu_cat <- factor(poststrat$edu_cat)
## sex_cat
# Make dummy variable for sex
poststrat <- poststrat %>% mutate(female=if_else(sex_cat == "Female",1,0))
# Add variable called perc that groups the count by constituencies (ccode) and calculates the percentage of each group
poststrat <- poststrat %>%
group_by(ccode) %>%
mutate(perc = count/sum(count)*100) %>% ungroup()
# Keep only the columns needed for post-stratification
poststrat <- poststrat %>% select(ccode, cname, age_cat, edu_cat, female, count, perc)
## There are 573 parliamentary constituencies in the post-stratification frame.
# Make a table with ccode and cname
constituencies <- poststrat %>% select(ccode, cname) %>% distinct()
Section 2. Estimating support for each party
2.1. Loading the survey data
The survey data is from Wave 17 of the 2014-2023 British Election
Study (BES) Internet Panel [2]. 34,366 respondents took wave 17 of the
British Election Study and was conducted by YouGov between 1st November
2019 and 12th November 2019. This wave was chosen as it was the latest
pre-election survey. The survey data was downloaded as an SPSS file and
loaded into R using the haven package.
Variables chosen:
pcon_code: Parliamentary constituency codepcon: Parliamentary constituency namegeneralElectionVote: Vote intention in the 2019 General Electionage: Agep_education: Highest level of educationgender: Gender (Male or Female)
Cleaning data
The variables were cleaned and recoded to match the
post-stratification frame. The outcome variable
generalElectionVote was recoded to have four categories:
Conservative, Labour, Liberal Democrat, and Others. Dummies were created
for each party for the multi-level regression analysis in section 3.
# Load the survey data
bes <- read_sav("bes2019.sav")
# Select variables of interest
bes <- bes %>% select(pcon_code, generalElectionVote, p_education, age, gender)
## pcon_code
# Remove respondents with missing constituencies or who belong to constituencies outside of England and Wales
bes <- bes %>% filter(pcon_code %in% poststrat$ccode)
# Checking if all 573 constituencies are represented in the survey data
# all(unique(bes$pcon_code) %in% p$ccode)
# Rename pcon_code to ccode for consistency
bes <- bes %>% rename(ccode = pcon_code)
# Make ccode factor
bes$ccode <- as.factor(bes$ccode)
## generalElectionVote
# Recode by making values 4-8 and 11-13 under 9
# The categories for outcome variable are Conservative, Labour, Liberal Democrats, Others
bes <- bes %>%
mutate(vote = case_when(
generalElectionVote == 1 ~ "Conservative",
generalElectionVote == 2 ~ "Labour",
generalElectionVote == 3 ~ "Liberal Democrat",
generalElectionVote %in% 4:13 ~ "Other", # Combining all other parties include SNP, Green Party, UKIP, Brexit Party, etc.
TRUE ~ "NA"
))
# remove rows with NA
bes <- bes %>% filter(vote != "NA")
# Make dummies for vote
bes <- bes %>% mutate(con = if_else(vote == "Conservative", 1, 0),
labour = if_else(vote == "Labour", 1, 0),
libdem = if_else(vote == "Liberal Democrat", 1, 0),
other = if_else(vote == "Other", 1, 0))
## age
# Make age categories
bes <- bes %>%
mutate(age_cat = case_when(
age < 16 ~ "Aged 15 years and under",
age < 25 ~ "Aged 16 to 24 years",
age < 35 ~ "Aged 25 to 34 years",
age < 50 ~ "Aged 35 to 49 years",
age < 65 ~ "Aged 50 to 64 years",
age >= 65 ~ "Aged 65 years and over",
TRUE ~ "NA"
))
# Make age_cat a factor variable
bes$age_cat <- factor(bes$age_cat)
## p_education
# Recode education categories
bes <- bes %>%
mutate(edu_cat = case_when(
p_education == 1 ~ "No qualifications",
p_education == 8 ~ "Level 1",
p_education %in% c(5,9,10) ~ "Level 2",
p_education %in% c(6,7,11,12) ~ "Level 3",
p_education %in% 13:17 ~ "Level 4 and above",
p_education %in% 19:20 ~ NA, # Coded "Don't know" and "Prefer not to say" as NA
TRUE ~ "Other"
))
# Remove NAs from edu_cat
bes <- bes %>% filter(!is.na(edu_cat))
# Make edu_cat a factor variable
bes$edu_cat <- factor(bes$edu_cat)
## gender
# Make dummy variable for gender
bes <- bes %>% mutate(female = ifelse(gender == 2, 1, 0))
# Keep variables of interest
bes <- bes %>% select(ccode, con, labour,libdem, other, edu_cat, age_cat, female)
2.2. Modelling party vote intention
A multilevel logistic model will be fitted for each party vote (Conservative, Labour, Liberal Democrat, Other parties) separately. The intuition behind using a multilevel logistic model is that the data is nested within constituencies. We have survey data at two levels: individual-level data containing characteristics such as age, gender, education, and vote intention dummies; and second-level data consisting of geographic areas corresponding to the respondents’ constituencies. Harnessing partial pooling will allow us to borrow strength from the data to estimate the party vote intention for each constituency. The level 1 model relates the vote intention to individual characteristics, while the level 2 model includes intercepts for each unit, in this case, constituencies. Thus, the model will account for the clustering of individuals within constituencies and hence provide more accurate estimates of party vote intentions.
con_vote <- glmer(formula = con ~ female + age_cat + edu_cat + (1 | ccode), family = binomial(link = "logit"), nAGQ=0, data = bes)
labour_vote <- glmer(formula = labour ~ female + age_cat + edu_cat + (1 | ccode), family = binomial(link = "logit"), nAGQ=0, data = bes)
libdem_vote <- glmer(formula = libdem ~ female + age_cat + edu_cat + (1 | ccode), family = binomial(link = "logit"), nAGQ=0, data = bes)
other_vote <- glmer(formula = other ~ female + age_cat + edu_cat + (1 | ccode), family = binomial(link = "logit"), nAGQ=0, data = bes)
2.3. Predicting probabilities of voting for each party
The next step is to produce a predicted probability of voting for
each party for every demographic type in every constituency. This is
done through the predict_prob function, which takes the
above models as inputs and generates predictions for the observations
contained in the dataset poststrat which is all the
possible combinations of demographic types in each constituency.
# Function to predict probabilities
predict_prob <- function(model) {
return(predict(model, newdata = poststrat, type = "response", allow.new.levels = TRUE))
}
# Predict probabilities for all parties
poststrat$con_pred <- predict_prob(con_vote)
poststrat$labour_pred <- predict_prob(labour_vote)
poststrat$libdem_pred <- predict_prob(libdem_vote)
poststrat$other_pred <- predict_prob(other_vote)
Section 3. Estimating voter turnout
Not all respondents who say they’re going to vote actually do. Hence, our model must factor turnout forecasting. Specifically, while we have predicted the probabilities of voting for each party, we also need to estimate the differential turnout rates for each demographic type within each constituency. This is done with the 2019 British Election Study Post-Election Random Probability Survey [3], which contains information on whether respondents voted in the 2019 election, as well as their geographic and demographic characteristics.
3.1. Loading the random probability survey data
Variables chosen:
b01: Electoral Behaviour; whether respondent voted in 2019 electionConstit_Code: Parliamentary constituency codeConstit_Name: Respondent’s constituency nameAge: Ageeducation: Highest level of educationy09: Gender
Cleaning data
As before, the data is cleaned and recoded to match the
post-stratification frame. The outcome variable b01 was
recoded to voted to indicate whether or not a respondent
voted in the 2019 election.
# Load the data
rps <- read_sav("bes-rps.sav")
# Keep variables of interest
rps <- rps %>% select(Constit_Code, Constit_Name, b01, y09, Age, education)
## Constit_Code
# Remove respondents with missing constituencies or who belong to constituencies outside of England and Wales
rps <- rps %>% filter(Constit_Code %in% poststrat$ccode)
#length(unique(rps$Constit_Code)) # 363, not all constituencies represented in the random probability survey
# Rename pcon_code to ccode for consistency
rps <- rps %>% rename(ccode = Constit_Code)
# Make ccode factor
rps$ccode <- as.factor(rps$ccode)
# Rename Constit_Name to cname
rps <- rps %>% rename(cname = Constit_Name)
## b01
# Remove negative values (missing values)
rps <- rps %>% filter(b01 >= 0)
# Make dummy variable for voting
rps <- rps %>% mutate(voted = ifelse(b01 == 1, 1, 0))
## Age
# Make age_cat
rps <- rps %>%
mutate(age_cat = case_when(
Age < 16 ~ "Aged 15 years and under",
Age < 25 ~ "Aged 16 to 24 years",
Age < 35 ~ "Aged 25 to 34 years",
Age < 50 ~ "Aged 35 to 49 years",
Age < 65 ~ "Aged 50 to 64 years",
Age >= 65 ~ "Aged 65 years and over",
TRUE ~ "NA"
)) %>%
relocate(age_cat, .after = Age)
# Remove NAs
rps <- rps %>% filter(age_cat != "NA")
# Make age_cat factor
rps$age_cat <- as.factor(rps$age_cat)
## education
# Recode education categories
rps <- rps %>%
mutate(edu_cat = case_when(
education == 0 ~ "No qualifications",
education %in% c(12,14) ~ "Level 1",
education %in% c(10,13) ~ "Level 2",
education %in% c(6,7,9) ~ "Level 3",
education %in% 1:5 ~ "Level 4 and above",
education < 0 ~ NA, # Coded "Don't know", "Not stated", and "Prefer not to say" as NA
TRUE ~ "Other"
))
# Remove NAs
rps <- rps %>% filter(!is.na(edu_cat))
## y09
# Keep only male and female for comparison with other data
rps <- rps %>% filter(y09 %in% c(1, 2))
# Make dummy
rps <- rps %>% mutate(female = ifelse(y09 == 2, 1, 0))
# Keep variables of interest
rps <- rps %>% select(ccode, cname, voted, edu_cat, age_cat, female)
3.2. Modelling voter turnout
The multilevel logistic model will indicate the direction and strength of the associations between the predictor variables and the likelihood of voting based on the demographic characteristics of the respondents using the random probability survey data. The outcome variable is a binary variable indicating whether the respondent voted or not. The predictors are age and highest level of education, and sex (female == 1). The model will account for the clustering of respondents within constituencies.
# Run the model
voted_model <- glmer(formula = voted ~ female + age_cat + edu_cat + (1 | ccode), family = binomial(link = "logit"), nAGQ=0, data = rps)
3.3. Predicting probabilities of voting
Similar to 2.3, the predict_prob function is used to
produce a predicted probability of voter turnout for each cell in the
post-stratification frame poststrat, based on the fitted
multilevel logistic regression model voted_model.
# Predicting probabilities
poststrat$turnout <- predict_prob(voted_model)
Section 4. Post-stratification
4.1 Calculate weighted
predictions based on perc and
turnout
The predictions from 2.3 are now weighted by the proportion of people
in each constituency that fall under these categories, or
perc, in the post-stratification frame
poststrat, as well as turnout. Using the
weight_pred function will therefore allow us to calculate
the weighted predictions for each demographic group’s vote share based
on both their percentage in the area’s population, and their probability
of actually voting as well.
# Function to calculate weighted predictions
weight_pred <- function(prediction) {
return(prediction * poststrat$perc * poststrat$turnout)
}
# Calculate weighted predictions for all parties and add to post-stratification frame
poststrat$con_weight.pred <- weight_pred(poststrat$con_pred)
poststrat$labour_weight.pred <- weight_pred(poststrat$labour_pred)
poststrat$libdem_weight.pred <- weight_pred(poststrat$libdem_pred)
poststrat$other_weight.pred <- weight_pred(poststrat$other_pred)
Section 5. Scaling estimates
In the final section, we will scale the estimates to match the true results of the election. In order to do this, the first step is to post-stratify the weighted predictions by constituency code. This will allow us to estimate the vote share for each party in each constituency. Afterwards, a scale factor can be calculated by taking the ratio of the true vote share to the estimated vote share for each party. Finally, the estimates can be scaled by multiplying the weighted predictions calculated in 4.1 by the scale factor for each party.
5.1. Function to post-stratify by constituency code
Post-stratification by constituency code is done with the
post_stratify function which adds up weighted predictions
for each area to get estimated vote share for each party. After
post-stratifying for all parties, the results are merged to get
estimates, a data frame with the estimated vote share for
each party in each constituency.
# Function to post-stratify by constituency code
post_stratify <- function(data, weight_col) {
return(data.table(data)[ , .(final_est = sum(get(weight_col))), by = .(ccode)])
}
# Post-stratify results for all parties
party_names <- c("con", "labour", "libdem", "other")
results_list <- lapply(party_names, function(party) {
results <- post_stratify(poststrat, paste0(party, "_weight.pred"))
results <- merge(constituencies, results, by = "ccode", all.x = TRUE) # Merge with constituency data to get names
setnames(results, "final_est", paste0(party, "_est")) # Rename the final estimate column
return(results)
})
# Merge all results by constituency code and name
estimates <- Reduce(function(x, y) merge(x, y, by = c("ccode", "cname"), all.x = TRUE), results_list) # Merge all results
5.2. Scaling estimates with true results
Using the 2019 BES Constituency Results with Census and Candidate Data [4], the true vote share for each party in each constituency is loaded and merged with the estimated vote share from the post-stratification. Variables chosen:
ONSConstID: Constituency codeConstituencyName: Conservative Party vote shareCon19: Conservative Party vote shareLab19: Labour Party vote shareLD19: Liberal Democrats vote shareOther19: Other parties vote share
Load and clean the 2019 general election results
# Load the results from the 2019 general election
results <- read_sav("election-results.sav")
# Create a new variable Others19 that sums the vote share of all other parties
results$Others19 <- rowSums(results[, 11:16], na.rm = TRUE)
# Extract the variables of interest and rename
results <- results %>%
select(ONSConstID, ConstituencyName, Con19, Lab19, LD19, Others19) %>%
rename(ccode = ONSConstID, cname = ConstituencyName, con_true = Con19, labour_true = Lab19, libdem_true = LD19, other_true = Others19)
## ccode
# Remove respondents with missing constituencies or who belong to constituencies outside of England and Wales
results <- results %>% filter(ccode %in% poststrat$ccode)
# length(unique(results$ccode)) # All 573 constituencies are present
# Merge the estimates with the 2019 election results
est_true_results <- results %>%
merge(estimates, by = c("ccode", "cname"), all.x = TRUE)
Scaling the estimates
Scaling factors are calculated for each party by dividing the true vote share by the estimated vote share. These scaling factors are then used to scale the estimated vote shares for each party in each constituency.
# Calculate Scaling Factors
scaling_factors <- est_true_results %>%
mutate(
con_sf = con_true / con_est,
labour_sf = labour_true / labour_est,
libdem_sf = libdem_true / libdem_est,
other_sf = other_true / other_est
) %>%
select(ccode, con_sf, labour_sf, libdem_sf, other_sf)
# Merge with post-stratification frame and scale the estimates
final_poststrat <- poststrat %>%
left_join(scaling_factors, by = "ccode") %>%
mutate(
con_scaled = con_weight.pred * con_sf,
labour_scaled = labour_weight.pred * labour_sf,
libdem_scaled = libdem_weight.pred * libdem_sf,
other_scaled = other_weight.pred * other_sf
)
# Remove the scaling factors
final_poststrat <- final_poststrat %>% select(-con_sf, -labour_sf, -libdem_sf, -other_sf)
# Reorder columns
final_poststrat <- final_poststrat %>% relocate(turnout, .after = perc)
Cleaning output and saving tables
# Create a named vector with variable descriptions
var.labels <- c(
ccode = "Constituency code",
cname = "Constituency name",
age_cat = "Age category",
edu_cat = "Education category",
female = "Sex (0 = Male, 1 = Female)",
count = "Number of respondents",
perc = "Percentage of respondents in constituency",
turnout = "Estimated turnout in constituency (from multilevel model)",
con_prediction = "Predicted Conservative vote share (from multilevel model)",
labour_prediction = "Predicted Labour vote share (from multilevel model)",
libdem_prediction = "Predicted Labour Democrat vote share (from multilevel model)",
other_prediction = "Predicted vote share for other parties (from multilevel model)",
con_weight.pred = "Weighted Conservative vote share estimate",
labour_weight.pred = "Weighted Labour vote share estimate",
libdem_weight.pred = "Weighted Labour Democrat vote share estimate",
other_weight.pred = "Weighted vote share estimate for other parties",
con_scaled = "Scaled Conservative vote share estimate",
labour_scaled = "Scaled Labour vote share estimate",
libdem_scaled = "Scaled Labour Democrat vote share estimate",
other_scaled = "Scaled vote share estimate for other parties"
)
# Assign variable descriptions to dataset
label(final_poststrat) = lapply(names(final_poststrat), function(x) var.labels[match(x, names(var.labels))])
# Save dataframes as csv
write.csv(final_poststrat, "final_poststrat.csv", row.names = FALSE)
write.csv(constituencies, "constituencies.csv", row.names = FALSE)
write.csv(est_true_results, "est_true_results.csv", row.names = FALSE)
Section 6. Results and Evaluating the MRP model
6.1. Getting Mean Absolute Error (MAE) for each party
## Conservative MAE: 17.81961
## Labour MAE: 14.29525
## Liberal Democrat MAE: 5.196204
## Other Parties MAE: 7.223156
The model doesn’t perform too well in predicting the vote share for the Conservative and Labour parties. On average, the model is getting the Conservative result wrong by 17.8 percentage points in each constituency, and the Labour result wrong by 14.3 percentage points. The model performs better for Liberal Democrats and other parties, with mean absolute errors of 5.1 and 7.2 percentage points, respectively.
6.2. How correctly does the MRP model predict wins?
Number of correctly-predicted constituencies: 463
Number of wrongly-predicted constituencies: 110
# Open party_wins csv
party_wins <- read.csv("party_wins.csv")
party_wins$diff <- party_wins$Predicted.Wins - party_wins$Actual.Wins
# Rename columns
colnames(party_wins) <- c("Party", "Predicted Wins", "Actual Wins", "Difference")
# display table
party_wins
6.3. Does the model underestimate or overestimate wins?
6.5. Discussions and further improvement
My analysis has highlighted the promising potential of the MRP approach as a reliable method for predicting vote share estimates in the context of the UK General Election, and we can see why it has successfully predicted elections in the past. Through partial-pooling and weighting predictions by a demographic groups’ percentage in a constituency as well as its probability of turnout, we can derive powerful estimates of vote shares. My model predicts the winner for 463 constituencies correctly, a large proportion despite using only 3 demographic variables to construct my post-stratification frame.
The MRP model is not without its caveats. The model’s performance in predicting the vote share of each party varies. Furthermore, the analysis only takes complete cases, and ignores “don’t know” responses, potentially removing valuable information. More advanced MRP modelling can factor in these responses, but this would require more complex imputation methods, thus the researcher must factor a trade-off between model complexity and accuracy. Since some values are NA in the results dataset for some constituencies and affiliated parties, the scaled estimates for some cells in the final post stratification frame are also NA though there are weighted predictions for these groups.
The model’s performance in predicting the vote share of each party also varies, and it overestimates Conservative wins and underestimates Labour wins. This discrepancy may stem from various factors, including sampling biases or the model’s sensitivity to specific predictors. Further refinement and validation are essential to enhance the model’s predictive accuracy and mitigate these discrepancies.
Further improvements can be made since this is just a simple MRP model that uses only a few variables to predict the vote share of each party. The above analysis did not factor how significant gender or age is in predicting vote intention. Thus, by excluding non-significant variables and replacing them with more significant ones, the model’s complexity can be reduced, and its accuracy can be improved. The model can be improved by adding more bells and whistles - these include more individual-level variables such as an individual’s vote in the previous election, and constituency-level variables such as the 2017 election result, population density, and poverty rate. Additionally, the model can be further improved by allowing the effects of individual-level variables to vary by constituency.